Loom test for deadlock observed in tokio's test suite #6876

jofas · 2024-09-28T14:02:21Z

This PR adds a Loom test for the deadlock observed in #6847.

When I run this test locally on my machine with

LOOM_MAX_PREEMPTIONS=1 LOOM_MAX_BRANCHES=10000 RUSTFLAGS="--cfg loom -C debug_assertions" \
    cargo test --lib --release --features full pool_deadlock_on_blocked_task \
    -- --test-threads=1 --nocapture

I get the following error:

running 1 test
test runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task ... thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at /home/masterusr/.cargo/registry/src/index.crates.io-6f17d22bba15001f/loom-0.7.2/src/rt/execution.rs:216:13:
deadlock; threads = [(Id(0), Blocked(Location(None))), (Id(1), Blocked(Location(None))), (Id(2), Blocked(Location(None)))]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at /home/masterusr/.cargo/registry/src/index.crates.io-6f17d22bba15001f/loom-0.7.2/src/rt/thread.rs:276:39:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x56425478aad5 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x5642547b34db - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x56425478834f - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x56425478a8ae - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x56425478c229 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x56425478bf6d - std::panicking::default_hook::h207342be97478370
   6:     0x56425478c7f6 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   7:     0x56425478c56b - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   8:     0x56425478af99 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   9:     0x56425478c2d7 - rust_begin_unwind
  10:     0x5642543ae4e3 - core::panicking::panic_fmt::hdc63834ffaaefae5
  11:     0x5642543ae58c - core::panicking::panic::h75b3c9209f97d725
  12:     0x5642543ae489 - core::option::unwrap_failed::h4b4353bf890a85df
  13:     0x5642545f8aff - loom::rt::object::Ref<T>::set_action::hd5b09cd3dece6232
  14:     0x56425461154c - scoped_tls::ScopedKey<T>::with::hd6ef3a1bee7ec98b
  15:     0x5642545e6bce - loom::rt::atomic::Atomic<T>::store::h5d2d323740f21a8e
  16:     0x56425459365c - tokio::runtime::scheduler::multi_thread::park::Parker::park::h4ae71e780fadb3a8
  17:     0x56425450599e - tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout::hc7658d589be3126b
  18:     0x56425450476c - tokio::runtime::scheduler::multi_thread::worker::Context::run::h57894510918d1e2b
  19:     0x56425454aafd - tokio::runtime::context::scoped::Scoped<T>::set::h8d9c484a2b1a5a11
  20:     0x5642544bdab5 - loom::thread::LocalKey<T>::try_with::h3f0132ee1c91aba6
  21:     0x564254524aeb - tokio::runtime::context::runtime::enter_runtime::h3ce4255900a8aedd
  22:     0x564254503b15 - tokio::runtime::scheduler::multi_thread::worker::run::h1f5b7e23b8e40277
  23:     0x5642544a9567 - loom::cell::unsafe_cell::UnsafeCell<T>::with_mut::h476d68d1ebb373c5
  24:     0x5642544eac54 - tokio::runtime::task::core::Core<T,S>::poll::h7aabc50663325fdb
  25:     0x56425442454e - tokio::runtime::task::harness::Harness<T,S>::poll::h56ead0a7ce702948
  26:     0x5642545a69be - tokio::runtime::blocking::pool::Inner::run::hfe127e926858c8af
  27:     0x564254562e2e - core::ops::function::FnOnce::call_once{{vtable.shim}}::hfbc82a17d67087f1
  28:     0x564254601527 - generator::stack::StackBox<F>::call_once::heb5e6a7940558221
  29:     0x56425475ed0b - std::panicking::try::h92c7df7a6cc6dd01
  30:     0x56425475f0c8 - generator::detail::gen::gen_init_impl::habe2c082c5ebb920
  31:     0x56425475ef79 - generator::detail::asm::gen_init::h5730af05b288df0e
  32:                0x0 - <unknown>
thread 'runtime::tests::loom_multi_thread::group_d::pool_deadlock_on_blocked_task' panicked at library/core/src/panicking.rs:228:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/home/masterusr/src/tokio/target/release/deps/tokio-367a8771b781d33e pool_deadlock_on_blocked_task --test-threads=1 --nocapture` (signal: 6, SIGABRT: process abort signal)

which I believe signifies that the test is able to successfully replicate the deadlock.

I used oneshot channels instead of barriers as is done in the flaky test where the deadlock was first observed, because Loom currently does not support Barriers.

I'm opening this up as a Draft PR because I'm looking for early feedback on whether I'm on the right track here or if I have misunderstood the assignment.

…on_queue_depth_multi_thread test

Darksonn · 2024-09-30T07:21:40Z

Yep, that looks like it catches the bug.

mox692 · 2024-10-07T14:55:30Z

Given Carl's suggestion here, I'm wondering if the loom test will succeed or not with one worker thread.

Darksonn · 2024-10-07T14:59:09Z

I think the bug requires at least one worker thread. I think the solution is to make sure that when the runtime is started, (at least one of) the worker threads should be in a searching state.

jofas · 2024-10-07T16:19:04Z

if the loom test will succeed or not with one worker thread

If we have only one worker thread and it were to deterministically always run the tasks in the order in which they are spawned, I'd expect the test to always deadlock. Even if it runs the task not deterministically, I believe a deadlock would happen with a lot more frequency (still every time the first task runs before the second) than it does with two or more workers.

The deadlock—as far as I see it—happens when the second worker is parked while the second task (which when executed would unlock the deadlock) is still in some queue elsewhere, the first worker being blocked by the first task, unable to notify the second worker to unpark again.

mox692 · 2024-10-13T14:04:40Z

Looking into the deadlock scenario more closely (using loom's checkpoint feature), I suspect that the following is happening:

Two workers, workerA and workerB, are started
Neither workerA nor workerB can fetch a task from the injection queue via next_task()
By a spawn call from the main thread, two tasks are pushed into the injection queue
WorkerA calls steal_work(), enters the searching state, and takes one task from the injection queue
WorkerB also calls steal_work(), but since half of the workers are already in the search state, WorkerB returns here and fails to fetch a task from injection queue
WorkerB is parked

carllerche · 2024-10-14T20:01:12Z

Looking into the deadlock scenario more closely (using loom's checkpoint feature), I suspect that the following is happening:

Two workers, workerA and workerB, are started

Neither workerA nor workerB can fetch a task from the injection queue via next_task()

By a spawn call from the main thread, two tasks are pushed into the injection queue

WorkerA calls steal_work(), enters the searching state, and takes one task from the injection queue

WorkerB also calls steal_work(), but since half of the workers are already in the search state, WorkerB returns here and fails to fetch a task from injection queue

WorkerB is parked

Thanks for looking into this. In theory, when Worker A finds a task, it transitions out of searching. If it is the last searching worker, it notifies a sleeping worker to wake up and try searching. In theory, WorkerB would wake and process the task. However, that logic is (intentionally) racy. Here, before parking, you can see a safeguard that handles this case. In the case of a worker transitioning out of searching because it finds work, the intent is that once the task is done being polled, the worker will find the rest of the work, thus mitigating the race. In this loom test, the task blocks forever, so the runtime deadlocks.

Unfortunately, there isn't much we can do in this specific case. If we want to make the runtime "bulletproof," the best strategy would be to add some logic to detect blocked tasks, report a warning, and poke the runtime to get unstuck.

We may want to rewrite the original flaky test to avoid blocking the runtime.

Thanks for looking into it, though. I hope that you found it a good learning experience. I'm also happy to answer any further runtime-related questions.

mox692 · 2024-10-16T09:41:25Z

In this loom test, the task blocks forever, so the runtime deadlocks.

Yes, I suppose this code is just for testing purposes, so I'm sure that it would be a different scenario with an ideal async task that does not block for a long time.

We may want to rewrite the original flaky test to avoid blocking the runtime.

I agree. @jofas Are you still interested in this issue? (If not, I can do that test fix)

jofas · 2024-10-16T10:01:22Z

Are you still interested in this issue? (If not, I can do that test fix)

I'd be interested in fixing the test, but I'm not exactly sure how we can determine the injection queue depth without blocking the two workers so that they can't consume any tasks from it. My initial idea was to keep the workers busy by continuously filling their LIFO slots with tasks so that they don't start consuming the injection queue, but I don't know if that'd work and even then it sounds a bit brittle to me. My second idea was to have only a single worker, block it like we already do and then fill up the injection queue, but I'm not sure if that'd be considered good enough of a test for the multithreaded runtime. How would you fix the test, if you don't mind me asking?

Darksonn · 2024-10-16T12:18:51Z

I think it's okay to say that if the test doesn't succeed within some timeout, you just restart the test. Then retry up to, let's say, 10 times. (You probably have to get the blocking tasks to exit to restart the test.)

Darksonn · 2024-10-21T08:06:04Z

I'm guessing this should be closed now that we merged #6916?

jofas · 2024-10-21T08:09:07Z

I agree. Should this discussion be summarized somewhere more prominently, in case people actually trigger the deadlock in their codebase?

Darksonn · 2024-10-21T08:14:43Z

We already have open bugs about tolerating blocking tasks.

Loom test triggering deadlock observed in tokio's rt_metrics::injecti…

27fd90a

…on_queue_depth_multi_thread test

jofas marked this pull request as ready for review October 1, 2024 06:46

Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Oct 7, 2024

jofas mentioned this pull request Oct 18, 2024

Fixed deadlock in injection_queue_depth_multi_thread test #6916

Merged

Darksonn closed this Oct 21, 2024

jofas deleted the loom-test-for-6847 branch October 21, 2024 08:16

mox692 mentioned this pull request Oct 23, 2024

Fix for flaky worker_steal_count test #6932

Merged

jofas mentioned this pull request Oct 25, 2024

Removed race condition from global_queue_depth_multi_thread test #6936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loom test for deadlock observed in tokio's test suite #6876

Loom test for deadlock observed in tokio's test suite #6876

jofas commented Sep 28, 2024

Darksonn commented Sep 30, 2024

mox692 commented Oct 7, 2024

Darksonn commented Oct 7, 2024 •

edited

Loading

jofas commented Oct 7, 2024

mox692 commented Oct 13, 2024 •

edited

Loading

carllerche commented Oct 14, 2024

mox692 commented Oct 16, 2024

jofas commented Oct 16, 2024

Darksonn commented Oct 16, 2024

Darksonn commented Oct 21, 2024

jofas commented Oct 21, 2024

Darksonn commented Oct 21, 2024

Loom test for deadlock observed in tokio's test suite #6876

Loom test for deadlock observed in tokio's test suite #6876

Conversation

jofas commented Sep 28, 2024

Darksonn commented Sep 30, 2024

mox692 commented Oct 7, 2024

Darksonn commented Oct 7, 2024 • edited Loading

jofas commented Oct 7, 2024

mox692 commented Oct 13, 2024 • edited Loading

carllerche commented Oct 14, 2024

mox692 commented Oct 16, 2024

jofas commented Oct 16, 2024

Darksonn commented Oct 16, 2024

Darksonn commented Oct 21, 2024

jofas commented Oct 21, 2024

Darksonn commented Oct 21, 2024

Darksonn commented Oct 7, 2024 •

edited

Loading

mox692 commented Oct 13, 2024 •

edited

Loading